import os
import sys
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
import pandas as pd
import numpy as np
import pyspark.pandas as ps
import pyspark
import plotly
from pyspark.sql import SparkSession
from math import sqrt
WARNING:root:'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.
import statistics as st
import matplotlib as mt
import matplotlib.pyplot as plt
import sqlite3
import matplotlib.pyplot as plot
import plotly.graph_objects as go
import plotly.express as px
from pandasql import sqldf
from sklearn import metrics
from sklearn import linear_model
from sklearn.linear_model import LassoCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
# Reading in the sqlite data
cnx = sqlite3.connect(':memory:')
plotly.offline.init_notebook_mode()
You should discuss the goals of the notebook, introduce your data set, and give the source for your data set
The goal of this notebook is to have a clear understanding of obesity rates in Mexico, Peru, and Colombia based on multiple metrics collected. This data comes to us from UCI Machine Learning Repository which gathered this data from Dataset for estimation of obesity levels based on eating habits and physical condition in individuals from Colombia, Peru and Mexico.
The data set has 17 columns and 2,111 observations. The columns are:
Gender: Patients gender {object}
Age: Patients age {float}
Height: Patients height {float}
Weight: Patients weight {float}
family_history_with_overweight: If the patients has a family history of overweight people {object}
FAVC: Frequent Consumption of High Caloric Food {objects}
FCVC: Frequent Consumption of Vegetables {float}
NCP: Number of Main Meals (how many meals the patient has daily) {float}
CAEC: Consumption of food between meals {object}
SMOKE: Does the patient smoke {object}
CH20: Consumption of water in liters {float}
SCC: Does the patient monitor the calories they consume {object}
FAF: How often does the patient have physical activity {float}
TUE: How often does the patient utilize technological devices {float} (e.g. phone, video games, TV's, computers, etc.) {float}
CALC: consumption of alcohol, how often does the patient drink alcohol {object}
MTRANS: What type of transportation does the patient normally use {object}
NObeyesdad: Patients weight status {object}
Obesity levels defined as:
Obesity_data = pd.read_csv("ObesityDataSet_raw_and_data_sinthetic.csv")
Obesity_data.to_sql(name='Obesity_data', con=cnx, index = False)
# Doing an initial data pull to understand the data
sample_data = pd.read_sql("""
SELECT *
FROM Obesity_data limit 5
""",cnx)
sample_data
| Gender | Age | Height | Weight | family_history_with_overweight | FAVC | FCVC | NCP | CAEC | SMOKE | CH2O | SCC | FAF | TUE | CALC | MTRANS | NObeyesdad | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Female | 21.0 | 1.62 | 64.0 | yes | no | 2.0 | 3.0 | Sometimes | no | 2.0 | no | 0.0 | 1.0 | no | Public_Transportation | Normal_Weight |
| 1 | Female | 21.0 | 1.52 | 56.0 | yes | no | 3.0 | 3.0 | Sometimes | yes | 3.0 | yes | 3.0 | 0.0 | Sometimes | Public_Transportation | Normal_Weight |
| 2 | Male | 23.0 | 1.80 | 77.0 | yes | no | 2.0 | 3.0 | Sometimes | no | 2.0 | no | 2.0 | 1.0 | Frequently | Public_Transportation | Normal_Weight |
| 3 | Male | 27.0 | 1.80 | 87.0 | no | no | 3.0 | 3.0 | Sometimes | no | 2.0 | no | 2.0 | 0.0 | Frequently | Walking | Overweight_Level_I |
| 4 | Male | 22.0 | 1.78 | 89.8 | no | no | 2.0 | 1.0 | Sometimes | no | 2.0 | no | 0.0 | 0.0 | Sometimes | Public_Transportation | Overweight_Level_II |
Give a discussion as to why we want to what we are generally trying to do with supervised learning where prediction is our goal. Discuss why we want to split our data into a training and test set.
You should also split the data into a training and test set
You should have a narrative that goes through what you are trying to accomplish in the EDA, why you are looking at a particular graph or statistic, and how you interpret what you’ve made. The EDA should be done on the training data only. You should use pandas-on-spark or spark SQL data frames (but matplotlib is fine)
Part of the final’s purpose is to see if you can judge what should and shouldn’t be included in an EDA.
No real assumption is made here prior to observing the data as we have no reason to believe that either gender would be more likely to face obesity than the other.
# Understanding how our population is split by gender
gender_split = pd.read_sql("""
select NObeyesdad,
gender,
count(*) as counts
from Obesity_data
group by 1,2
""",cnx)
gender_split
| NObeyesdad | Gender | counts | |
|---|---|---|---|
| 0 | Insufficient_Weight | Female | 173 |
| 1 | Insufficient_Weight | Male | 99 |
| 2 | Normal_Weight | Female | 141 |
| 3 | Normal_Weight | Male | 146 |
| 4 | Obesity_Type_I | Female | 156 |
| 5 | Obesity_Type_I | Male | 195 |
| 6 | Obesity_Type_II | Female | 2 |
| 7 | Obesity_Type_II | Male | 295 |
| 8 | Obesity_Type_III | Female | 323 |
| 9 | Obesity_Type_III | Male | 1 |
| 10 | Overweight_Level_I | Female | 145 |
| 11 | Overweight_Level_I | Male | 145 |
| 12 | Overweight_Level_II | Female | 103 |
| 13 | Overweight_Level_II | Male | 187 |
Although not a huge data set it is still hard to understand our results in this format. The visualization below should help us out.
fig = px.pie(gender_split, values='counts', names='Gender')
fig.show()
We see that overall our data is very evenly split when it comes to gender. This shouldn't come as a surprise to us. Further analysis should show if there is a correlation between gender and obesity rate.
fig = px.bar(gender_split, x="NObeyesdad", y="counts",
color='Gender',barmode='group',
height=400)
fig.show()
Immediately we start to see some interesting feature of our data. We see the following:
Insufficient Weight: There are more women in who are of insufficient weight than men. This could have multiple reasons but one that comes to mind is the pressure on young women to thin.
Normal Weight: This is evenly split.
Obesity Type I: This is skewed male but not overly so.
Obesity Type II: Is predominantly male, this could be attributed to the way BMI is measure utilizing only weight and height and not higher than average muscle mass which many young men tend to have.
Obesity Type III: This is surprisingly almost entirely female. Because this is measure based on BMI it might stand to reason that if a man and a woman weigh the same a woman would likely have a higher BMI do to either height differences or assumed muscle mass differences.
Overweight Level I and II: These measurements seems to be fairly evenly split between gender with male being on the heavier side.
An assumption made here is that smoking would correlate to someone being overweight and obese. The idea that one bad habit could lead to another as well as the assumption that smokers are less healthy because they smoke and thus might excessive less.
# Understanding how our population is split by gender
Smoke_split = pd.read_sql("""
select NObeyesdad,
Smoke,
count(*) as counts
from Obesity_data
group by 1,2
""",cnx)
Smoke_split
| NObeyesdad | SMOKE | counts | |
|---|---|---|---|
| 0 | Insufficient_Weight | no | 271 |
| 1 | Insufficient_Weight | yes | 1 |
| 2 | Normal_Weight | no | 274 |
| 3 | Normal_Weight | yes | 13 |
| 4 | Obesity_Type_I | no | 345 |
| 5 | Obesity_Type_I | yes | 6 |
| 6 | Obesity_Type_II | no | 282 |
| 7 | Obesity_Type_II | yes | 15 |
| 8 | Obesity_Type_III | no | 323 |
| 9 | Obesity_Type_III | yes | 1 |
| 10 | Overweight_Level_I | no | 287 |
| 11 | Overweight_Level_I | yes | 3 |
| 12 | Overweight_Level_II | no | 285 |
| 13 | Overweight_Level_II | yes | 5 |
fig = px.pie(Smoke_split, values='counts', names='SMOKE')
fig.show()
We see that the vast majority of our data set shows that people in mass do not smoke. For this reason further analysis on this variable would be hard to visualize without accounting for the near 49:1 ratio.
Two opposing thoughts here are that younger people would be more fit due to their age and potentially being more active. However obese people might not make it to an older age to skew the data.
# Understanding how our population is split by gender
Age_split = pd.read_sql("""
select NObeyesdad,
AVG(age) as mean_age
from Obesity_data
group by 1
""",cnx)
Age_split
| NObeyesdad | mean_age | |
|---|---|---|
| 0 | Insufficient_Weight | 19.783237 |
| 1 | Normal_Weight | 21.738676 |
| 2 | Obesity_Type_I | 25.884941 |
| 3 | Obesity_Type_II | 28.233785 |
| 4 | Obesity_Type_III | 23.495554 |
| 5 | Overweight_Level_I | 23.417674 |
| 6 | Overweight_Level_II | 26.996981 |
# Creating a helper function for the graph below.
def lines_on_graph(position,height):
fig.add_trace(go.Scatter(
x=[Age_split['mean_age'].iloc[position],Age_split['mean_age'].iloc[position]],
y=[0,height],
mode="lines",
name=Age_split['NObeyesdad'].iloc[position],
textposition="top center"
))
fig = px.histogram(Obesity_data, x="Age")
lines_on_graph(0,300)
lines_on_graph(1,300)
lines_on_graph(2,300)
lines_on_graph(3,300)
lines_on_graph(4,300)
lines_on_graph(5,300)
lines_on_graph(6,300)
fig.show()
We see that our data is right tailed with a few data records showing people in their 40's, 50's and even 60's. We also do see that Obesity and overweight might be correlated to age since our Insufficient_Weight and Normal_Weight groups are both in the younger side of our distribution when compared to the other groups.
Due to our data being skewed we don't have huge insights into how obese people do later in life through the visual above. However a correlation test below will help with this.
"wealth begets wealth" is a common saying. Meaning wealth brings forth more wealth as in wealthy parents might raise a child who in turn will be wealthy. Family History might tell us a lot about someones likelihood of being obese. Here we will explore if "Obesity begets obesity".
# Understanding how our population is split by gender
family_hisotry_split = pd.read_sql("""
select
NObeyesdad,
family_history_with_overweight,
count(*) as counts
from Obesity_data
group by 1,2
""",cnx)
family_hisotry_split
| NObeyesdad | family_history_with_overweight | counts | |
|---|---|---|---|
| 0 | Insufficient_Weight | no | 146 |
| 1 | Insufficient_Weight | yes | 126 |
| 2 | Normal_Weight | no | 132 |
| 3 | Normal_Weight | yes | 155 |
| 4 | Obesity_Type_I | no | 7 |
| 5 | Obesity_Type_I | yes | 344 |
| 6 | Obesity_Type_II | no | 1 |
| 7 | Obesity_Type_II | yes | 296 |
| 8 | Obesity_Type_III | yes | 324 |
| 9 | Overweight_Level_I | no | 81 |
| 10 | Overweight_Level_I | yes | 209 |
| 11 | Overweight_Level_II | no | 18 |
| 12 | Overweight_Level_II | yes | 272 |
fig = px.bar(family_hisotry_split, x="NObeyesdad", y="counts",
color='family_history_with_overweight',barmode='group',
height=400)
fig.show()
We see that people with a history of Obesity of any kind seem to be more likely to be obese themselves. This seemed specially true for people that are obesity more than overweight.
# Understanding how our population is split by gender
MTRANS_split = pd.read_sql("""
SELECT
MTRANS,
NObeyesdad,
count(*) as counts
FROM Obesity_data
GROUP BY 1,2
""",cnx)
MTRANS_split
| MTRANS | NObeyesdad | counts | |
|---|---|---|---|
| 0 | Automobile | Insufficient_Weight | 46 |
| 1 | Automobile | Normal_Weight | 45 |
| 2 | Automobile | Obesity_Type_I | 110 |
| 3 | Automobile | Obesity_Type_II | 95 |
| 4 | Automobile | Obesity_Type_III | 1 |
| 5 | Automobile | Overweight_Level_I | 66 |
| 6 | Automobile | Overweight_Level_II | 94 |
| 7 | Bike | Normal_Weight | 4 |
| 8 | Bike | Obesity_Type_II | 1 |
| 9 | Bike | Overweight_Level_I | 2 |
| 10 | Motorbike | Normal_Weight | 6 |
| 11 | Motorbike | Obesity_Type_I | 3 |
| 12 | Motorbike | Overweight_Level_I | 1 |
| 13 | Motorbike | Overweight_Level_II | 1 |
| 14 | Public_Transportation | Insufficient_Weight | 220 |
| 15 | Public_Transportation | Normal_Weight | 200 |
| 16 | Public_Transportation | Obesity_Type_I | 236 |
| 17 | Public_Transportation | Obesity_Type_II | 200 |
| 18 | Public_Transportation | Obesity_Type_III | 323 |
| 19 | Public_Transportation | Overweight_Level_I | 212 |
| 20 | Public_Transportation | Overweight_Level_II | 189 |
| 21 | Walking | Insufficient_Weight | 6 |
| 22 | Walking | Normal_Weight | 32 |
| 23 | Walking | Obesity_Type_I | 2 |
| 24 | Walking | Obesity_Type_II | 1 |
| 25 | Walking | Overweight_Level_I | 9 |
| 26 | Walking | Overweight_Level_II | 6 |
fig = px.bar(MTRANS_split, x="NObeyesdad", y="counts",
color='MTRANS',
height=400)
fig.show()
pd.crosstab(index=Obesity_data["NObeyesdad"],columns=Obesity_data["MTRANS"])
| MTRANS | Automobile | Bike | Motorbike | Public_Transportation | Walking |
|---|---|---|---|---|---|
| NObeyesdad | |||||
| Insufficient_Weight | 46 | 0 | 0 | 220 | 6 |
| Normal_Weight | 45 | 4 | 6 | 200 | 32 |
| Obesity_Type_I | 110 | 0 | 3 | 236 | 2 |
| Obesity_Type_II | 95 | 1 | 0 | 200 | 1 |
| Obesity_Type_III | 1 | 0 | 0 | 323 | 0 |
| Overweight_Level_I | 66 | 2 | 1 | 212 | 9 |
| Overweight_Level_II | 94 | 0 | 1 | 189 | 6 |
From both the bar chart above and the cross table we see that there seems to be some correlation between the way people move about and they weight. e.g. a lot of people who walk are in the normal weight category. A simple linear model could tell us the relationship better but all we know for now is that further analysis is needed.
The assumption here is the frequent consumption of high caloric intake will increase someones likelihood of being obese.
# Understanding how consumption of high caloric foods is related to weight
favc_split = pd.read_sql("""
select NObeyesdad,
FAVC,
count(*) as counts
from Obesity_data
group by 1,2
""",cnx)
favc_split
| NObeyesdad | FAVC | counts | |
|---|---|---|---|
| 0 | Insufficient_Weight | no | 51 |
| 1 | Insufficient_Weight | yes | 221 |
| 2 | Normal_Weight | no | 79 |
| 3 | Normal_Weight | yes | 208 |
| 4 | Obesity_Type_I | no | 11 |
| 5 | Obesity_Type_I | yes | 340 |
| 6 | Obesity_Type_II | no | 7 |
| 7 | Obesity_Type_II | yes | 290 |
| 8 | Obesity_Type_III | no | 1 |
| 9 | Obesity_Type_III | yes | 323 |
| 10 | Overweight_Level_I | no | 22 |
| 11 | Overweight_Level_I | yes | 268 |
| 12 | Overweight_Level_II | no | 74 |
| 13 | Overweight_Level_II | yes | 216 |
fig = px.pie(favc_split, values='counts', names='FAVC')
fig.show()
Given the above visual and table it's kind of hard to identify a trend in the consumption of high caloric foods and weight gain. It seems like the majority of the individuals in this dataset frequently consume high caloric foods regardless of weight category.
fig = px.bar(favc_split, x="NObeyesdad", y="counts",
color='FAVC',barmode = 'group',
height=400)
fig.show()
Do to the fact that most of our data points consume high caloric food frequently it's hard to tell how much this variable influences our results. We would need to do some ration comparison but instead we can rely on the correlation below.
#Obesity_data['WeightCatNumeric']=np.select([Obesity_data['NObeyesdad'] == "Insufficient_Weight",Obesity_data['NObeyesdad'] == "Normal_Weight",Obesity_data['NObeyesdad'] == "Overweight_Level_I" \
# ,Obesity_data['NObeyesdad'] == "Overweight_Level_II",Obesity_data['NObeyesdad'] == "Obesity_Type_I",Obesity_data['NObeyesdad'] == "Obesity_Type_II",Obesity_data['NObeyesdad'] == "Obesity_Type_III"]
# ,[0,1,2,3,4,5,6])
#Obesity_data.corr()
Obesity_data['WeightCatNumeric']=np.select([Obesity_data['NObeyesdad'].isin(["Insufficient_Weight","Normal_Weight","Overweight_Level_I","Overweight_Level_II"]) \
,Obesity_data['NObeyesdad'].isin(["Obesity_Type_I","Obesity_Type_II","Obesity_Type_III"])]
,[0,1])
Obesity_data.corr()
| Age | Height | Weight | FCVC | NCP | CH2O | FAF | TUE | WeightCatNumeric | |
|---|---|---|---|---|---|---|---|---|---|
| Age | 1.000000 | -0.025958 | 0.202560 | 0.016291 | -0.043944 | -0.045304 | -0.144938 | -0.296931 | 0.217473 |
| Height | -0.025958 | 1.000000 | 0.463136 | -0.038121 | 0.243672 | 0.213376 | 0.294709 | 0.051912 | 0.137413 |
| Weight | 0.202560 | 0.463136 | 1.000000 | 0.216125 | 0.107469 | 0.200575 | -0.051436 | -0.071561 | 0.793652 |
| FCVC | 0.016291 | -0.038121 | 0.216125 | 1.000000 | 0.042216 | 0.068461 | 0.019939 | -0.101135 | 0.174848 |
| NCP | -0.043944 | 0.243672 | 0.107469 | 0.042216 | 1.000000 | 0.057088 | 0.129504 | 0.036326 | 0.037004 |
| CH2O | -0.045304 | 0.213376 | 0.200575 | 0.068461 | 0.057088 | 1.000000 | 0.167236 | 0.011965 | 0.097423 |
| FAF | -0.144938 | 0.294709 | -0.051436 | 0.019939 | 0.129504 | 0.167236 | 1.000000 | 0.058562 | -0.147097 |
| TUE | -0.296931 | 0.051912 | -0.071561 | -0.101135 | 0.036326 | 0.011965 | 0.058562 | 1.000000 | -0.082742 |
| WeightCatNumeric | 0.217473 | 0.137413 | 0.793652 | 0.174848 | 0.037004 | 0.097423 | -0.147097 | -0.082742 | 1.000000 |
Next, you should fit three different classes of models (they can be the ones we did in class or you can branch out). You can have a numeric response or a binary response.
With each model type you use, you should describe the general idea of the model/how it works. These discussion don’t need to be super long, but they should be clear and hit on the most important points about how the model works.
You should use CV to choose among the candidate models for each model type.
• You should set up a pipeline in pyspark for each of your models
• At least one of the pipelines should include at least two transformations prior to the model fit (estimator)
• You can use the same set of transformations for multiple models (if appropriate)
We will start our models with a Simple Linear Regression to have good grasp of what it is we are looking at. We will also utilize a LASSO model to pick the best variables for our model with $\alpha$ as our tuning parameter.
from sklearn.model_selection import train_test_split, GridSearchCV, cross_validate
#Get numeric variabls
Obesity_data_numerics = Obesity_data[["Age", "Height", "Weight", "FCVC", "NCP", "FAF"]]
# Let's use the two categorical variables we saw some correlation in
transport_dummies = pd.get_dummies(Obesity_data.MTRANS)
history_dummies = pd.get_dummies(Obesity_data.family_history_with_overweight)
# Join numeric and categorical variables to be use
Obesity_data_x = Obesity_data_numerics.join(transport_dummies).join(history_dummies)
obesity_dummies = pd.get_dummies(Obesity_data.NObeyesdad)
Obesity_data_y = obesity_dummies.Obesity_Type_I + obesity_dummies.Obesity_Type_II + obesity_dummies.Obesity_Type_III
X_train, X_test, y_train, y_test = train_test_split(
Obesity_data_x,
Obesity_data_y,
test_size=0.20,
random_state=42)
Utilizing a LASSO model to get a general idea of what we our variables look like. The idea here is that we can utilize a linear model but penalize it the best way possible to fit our data. Further models will be more advanced/complex and hopefully will fit our model better.
# Setting up a 20/80 split for our data utilizing random number generator.
#X_train, X_test, y_train, y_test = train_test_split(Obesity_data_for_model.iloc[:, 1:].values, Obesity_data_for_model["WeightCatNumeric"].values, test_size=0.20, random_state=42)
# we can utilize LASSO to help us identify our best alpha for our LASSO model below. This will help us weed our variables that do not meet our standard.
lasso_mod = LassoCV(cv=50, random_state=0) \
.fit(X_train,
y_train)
# Printing our our alpha value just to see it in our end result. This is the alpha value that will
lasso_mod.alpha_
0.010287014986838724
We see a somewhat small alpha value meaning that our Lasso regression won't be too different than our linear regression. We hope that this helps us generalize our model to the point where we get the best results possible from a linear model. Next we will utilize other technique which we suspect will fit our data better.
lasso = linear_model.Lasso(alpha=lasso_mod.alpha_)
lasso.fit(X_train, y_train)
Lasso(alpha=0.010287014986838724)
With a low level for alpha our LASSO model is penalizing our linear regression ever so slightly meaning our LASSO model is "very" close to a simple linear regression.
print(lasso.intercept_,lasso.coef_)
-0.8390072474223554 [ 0.00272712 -0. 0.01488881 0. -0.00382104 -0.04860682 -0. -0. -0. 0. -0. -0. 0. ]
y_pred = lasso.predict(X_test)
RMSE_lasso = np.sqrt(mean_squared_error(y_test,y_pred))
print("RMSE for the LASSO model is: ", RMSE_lasso)
print("With an R2 of: ",r2_score(y_test, y_pred))
RMSE for the LASSO model is: 0.29284206338054114 With an R2 of: 0.6557717123216504
We see that we start with a low RMSE to begging with. This doesn't tell us much since RMSE is utilize to compare models and currently we only have one model. Let's see if we can predict our data better with mode complex/advance models.
An $R^2$ here tells us that our model has a "decent" linear association with our Y variable. Meaning we are doing "OK" at predicting if someone will be obese of not.
Logistic regression models success probability(models avg. # of success for given x variables). The function never goes above 0 or 1. This is good since we have a binary response where we can say success is 1 and failure is 0.
The interpretation of the betas are a little different than that of the linear regression. In logistic regression the betas represent the change in log-odds in the response for a unit change in that respective x.
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV, cross_validate
log_reg = LogisticRegression(penalty="none", max_iter=55000, solver="saga")
cv_log = cross_validate(log_reg,
X_train,
y_train,
cv = 5)
A Knn model will help us understand if someone is obese or not based on all the variables in the dataset. It does this by estimating if someone is obese based on the euclidean distance between other values.
knn_model = KNeighborsRegressor(n_neighbors=3)
knn_model.fit(X_train, y_train)
train_preds = knn_model.predict(X_train)
mse = mean_squared_error(y_train, train_preds)
rmse = sqrt(mse)
test_preds = knn_model.predict(X_test)
print("RMSE for the LASSO model is: ", rmse)
print("With an R2 of: ",r2_score(y_test, test_preds))
RMSE for the LASSO model is: 0.10068224354547423 With an R2 of: 0.9240847092605886
knn_model = KNeighborsRegressor(n_neighbors=2)
knn_model.fit(X_train, y_train)
train_preds = knn_model.predict(X_train)
mse = mean_squared_error(y_train, train_preds)
rmse = sqrt(mse)
test_preds = knn_model.predict(X_test)
print("RMSE for the LASSO model is: ", rmse)
print("With an R2 of: ",r2_score(y_test, test_preds))
RMSE for the LASSO model is: 0.08343201108647477 With an R2 of: 0.9051058865757358